Skip to content

Conversation

Julien-Ben
Copy link
Collaborator

@Julien-Ben Julien-Ben commented Oct 1, 2025

CLOUDP-347497 - Single cluster Replica Set Controller Refactoring

Why this refactoring

The single-cluster RS controller was mixing two concerns:

  • Kubernetes stuff (StatefulSets, pods, volumes)
  • Ops Manager/MongoDB stuff (MongoDB processes, replication config)

This worked fine for single-cluster, but it's a problem when you think about multi-cluster:

  • Multi-cluster has multiple StatefulSets (one per cluster) but only one logical ReplicaSet in Ops Manager
  • The OM automation config doesn't care about how many K8s clusters you have or how the pods are deployed

So we need to separate these layers properly.

Main changes

1. Broke down the huge Reconcile() method

Before: ~300 lines of inline logic in Reconcile()

Now:

Reconcile()
  ├── reconcileMemberResources()        // Handles all K8s resource creation
  │   ├── reconcileHostnameOverrideConfigMap()
  │   ├── ensureRoles()
  │   └── reconcileStatefulSet()        // StatefulSet-specific logic isolated here
  │       └── buildStatefulSetOptions() // Builds STS configuration
  └── updateOmDeploymentRs()            // Handles Ops Manager automation config updates

This makes it way easier to understand what's happening and matches the multi-cluster controller structure.

2. Removed StatefulSet dependency from OM operations

Created new helper functions that work directly with MongoDB resources instead of StatefulSets:

  • CreateMongodProcessesFromMongoDB() - was using StatefulSet before
  • BuildFromMongoDBWithReplicas() - same
  • WaitForRsAgentsToRegisterByResource() - same

These mirror the existing ...FromStatefulSet functions but take MongoDB resources instead.

Why it matters: The OM layer now only cares about the MongoDB resource definition, not how it's deployed in K8s. This makes the code work the same way for both single-cluster and multi-cluster.

3. Added publishAutomationConfigFirstRS checks

Dedicated function for RS instead of using the shared one. Does not rely on a statefulset object.

Important for review

The ideal way to review this PR is to compare the new structure to the mongodbmultireplicaset_controller.go. The aim of the refactoring is to get the single cluster controller closer to it.

Look at:

  • reconcileMemberResources() in both controllers - similar structure now
  • updateOmDeploymentRs() - no more StatefulSet dependency
  • New helper functions in om/process and om/replicaset - mirror existing patterns

Bug found along the way

The logic to handle scale up + disable TLS at the same time doesn't actually work properly and should be blocked by validation (see draft PR #490 for more details).

Tests added

Added tests for the new helper functions:

  • TestCreateMongodProcessesFromMongoDB - basic scenarios, scaling, custom domains, TLS, additional config
  • TestBuildFromMongoDBWithReplicas - integration test checking ReplicaSet structure and member options propagation
  • TestPublishAutomationConfigFirstRS - automation config publish logic with various TLS/auth scenarios

@github-actions
Copy link

github-actions bot commented Oct 1, 2025

⚠️ (this preview might not be accurate if the PR is not rebased on current master branch)

MCK 1.6.0 Release Notes

New Features

  • MongoDBCommunity: Added support to configure custom cluster domain via newly introduced spec.clusterDomain resource field. If spec.clusterDomain is not set, environment variable CLUSTER_DOMAIN is used as cluster domain. If the environment variable CLUSTER_DOMAIN is also not set, operator falls back to cluster.local as default cluster domain.
  • Helm Chart: Introduced two new helm fields operator.podSecurityContext and operator.securityContext that can be used to configure securityContext for Operator deployment through Helm Chart.

Bug Fixes

  • Fixed parsing of the customEnvVars Helm value when values contain = characters.

@Julien-Ben Julien-Ben added the skip-changelog Use this label in Pull Request to not require new changelog entry file label Oct 7, 2025
Builder functions
Test works
Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Early review: I had a quick-ish look at the controllers and couldn't spot anything obviously wrong with it. I didn't review the tests at all.

Looks good so far.

}
}

publishAutomationConfigFirst, err := r.publishAutomationConfigFirstMultiCluster(ctx, &mrs, log)
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Brought it closer to where it's actually used

log.Debugw("Marked replica set members as non-voting", "replica set with members", rsMembers)
}

// TODO practice shows that automation agents can get stuck on setting db to "disabled" also it seems that this process
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This comment was old (<2022)

@Julien-Ben Julien-Ben marked this pull request as ready for review October 7, 2025 16:27
@Julien-Ben Julien-Ben requested a review from a team as a code owner October 7, 2025 16:27
// Reconcile reads that state of the cluster for a MongoDbReplicaSet object and makes changes based on the state read
// and what is in the MongoDbReplicaSet.Spec
func (r *ReconcileMongoDbReplicaSet) Reconcile(ctx context.Context, request reconcile.Request) (res reconcile.Result, e error) {
// === 1. Initial Checks and setup
Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I added these as helpers for myself at the beginning of the refactoring, to keep track of where I was in the hundreds of lines of the reconcile loop

I'm okay to remove them now if they feel like noise more than useful comments

@Julien-Ben Julien-Ben changed the title [DO NOT REVIEW] WIP CLOUDP-347497: Single cluster Replica Set Controller Refactoring Oct 8, 2025
Copy link
Contributor

@m1kola m1kola left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Looks good to me: I didn't pot any issues.

Let a few comments\suggestions, but I do not consider them blocking.

agentCertPath: agentCertPath,
agentCertHash: agentCertHash,
currentAgentAuthMode: currentAgentAuthMode,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Nit: I think a long list of arguments is better than a struct. It is easier to pass a struct around, but arguments are more explicit: you only pass what you need in a given function. With plain arguments it is easier to spot unused arguments as well.

There is also no question like - why some cert related stuff is in the struct, but tlsCertPath and internalClusterCertPath is not?

I would leave arguments as is.

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mikalai here, it is a bit confusing

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

That's a valid concern, I hesitated and did it mostly to match what we do in the sharded cluster controller:

type deploymentOptions struct {
	podEnvVars           *env.PodEnvVars
	currentAgentAuthMode string
	caFilePath           string
	agentCertPath        string
	agentCertHash        string
	certTLSType          map[string]bool
	finalizing           bool
	processNames         []string
	prometheusCertHash   string
}

If we decide to keep the struct, I will put more variables into it for sure while working on the unification, for more consistency.

I'm happy to hear the team's opinion about this

Copy link
Contributor

@lucian-tosa lucian-tosa left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM, just a couple of questions

agentCertPath: agentCertPath,
agentCertHash: agentCertHash,
currentAgentAuthMode: currentAgentAuthMode,
}
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I agree with Mikalai here, it is a bit confusing

Copy link
Contributor

@anandsyncs anandsyncs left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Left some thoughts on further refactoring and a couple of nits.

prometheusCertHash, err := certs.EnsureTLSCertsForPrometheus(ctx, r.SecretClient, rs.GetNamespace(), rs.GetPrometheus(), certs.Database, log)
if err != nil {
log.Infof("Could not generate certificates for Prometheus: %s", err)
return r.updateStatus(ctx, rs, workflow.Failed(err), log)
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Suggested change
return r.updateStatus(ctx, rs, workflow.Failed(err), log)
return r.updateStatus(ctx, rs, workflow.Failed(xerrors.Errorf("could not generate certificates for Prometheus: %w", err)), log)

other errors are more descriptive, is it better to clarify this one too?

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good catch, done

databaseContainer := container.GetByName(util.DatabaseContainerName, currentSts.Spec.Template.Spec.Containers)
volumeMounts := databaseContainer.VolumeMounts

if !mdb.Spec.Security.IsTLSEnabled() && wasTLSSecretMounted(ctx, getter, currentSts, mdb, log) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

We probably need a nil check on .Security

Copy link
Collaborator Author

@Julien-Ben Julien-Ben Oct 16, 2025

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I took it from the common controller, I haven't written it myself, so I will mark it as follow up, but I think we have guardrails against nil Security at initialization, otherwise we would have observed panics already 😅

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find any initialisation. Most of the time we do mdb.GetSecurity() which returns an empty Security value if mdb.Spec.Security is nil.

// updateOmDeploymentRs performs OM registration operation for the replicaset. So the changes will be finally propagated
// to automation agents in containers
func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, set appsv1.StatefulSet, log *zap.SugaredLogger, agentCertPath, caFilePath, tlsCertPath, internalClusterCertPath string, prometheusCertHash string, isRecovering bool, shouldMirrorKeyfileForMongot bool) workflow.Status {
func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(ctx context.Context, conn om.Connection, membersNumberBefore int, rs *mdbv1.MongoDB, log *zap.SugaredLogger, tlsCertPath, internalClusterCertPath string, deploymentOptionsRS deploymentOptionsRS, shouldMirrorKeyfile bool, isRecovering bool) workflow.Status {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

updateOmDeploymentRs is carrying a lot of responsibilities: waiting for agents, TLS disable coordination, replica-set building, authentication sync, automation-config reconciliation, and cleanup.
 

func (r *ReconcileMongoDbReplicaSet) updateOmDeploymentRs(
ctx context.Context,
conn om.Connection,
membersBefore int,
rs *mdbv1.MongoDB,
log *zap.SugaredLogger,
tlsCertPath, internalClusterCertPath string,
opts deploymentOptionsRS,
shouldMirrorKeyfile, isRecovering bool,
) workflow.Status {
inputs, status := r.prepareOmDeploymentInputs(ctx, conn, membersBefore, rs, tlsCertPath, opts, shouldMirrorKeyfile, isRecovering, log)
if !status.IsOK() {
return status
}
 
if status := r.applyReplicaSetAutomation(ctx, conn, rs, inputs, internalClusterCertPath, log); !status.IsOK() {
return status
}
 
return r.finalizeOmDeployment(ctx, conn, rs, membersBefore, inputs, internalClusterCertPath, isRecovering, log)
}

 
Each helper focuses on one concern:
 

type omDeploymentInputs struct {
replicasTarget           int
replicaSet               replicaset.ReplicaSet
processNames             []string
prometheusConfiguration  PrometheusConfiguration
additionalReconcileNeeded bool
}
 
func (r *ReconcileMongoDbReplicaSet) prepareOmDeploymentInputs(...) (omDeploymentInputs, workflow.Status) { /* wait for agents, handle TLS disable, build replicaSet */ }
 
func (r *ReconcileMongoDbReplicaSet) applyReplicaSetAutomation(...) workflow.Status { /* auth reconciliation + ReadUpdateDeployment */ }
 
func (r *ReconcileMongoDbReplicaSet) finalizeOmDeployment(...) workflow.Status { /* wait for ready, log rotate, host diff, backup */ }

Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

You can also use parameters instead of omDeploymentInputs

Copy link
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Same remark as in the other comment, I'll mark it as follow up to not increase the size of the PR, since I just slightly modified the method here.
But if I have time during the unification implementation, I'll make sure to do it ! Thanks for the suggestion

databaseContainer := container.GetByName(util.DatabaseContainerName, currentSts.Spec.Template.Spec.Containers)
volumeMounts := databaseContainer.VolumeMounts

if !mdb.Spec.Security.IsTLSEnabled() && wasTLSSecretMounted(ctx, getter, currentSts, mdb, log) {
Copy link
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I don't find any initialisation. Most of the time we do mdb.GetSecurity() which returns an empty Security value if mdb.Spec.Security is nil.

@Julien-Ben Julien-Ben merged commit c10af85 into master Oct 20, 2025
38 checks passed
@Julien-Ben Julien-Ben deleted the jben/rs-controller-refactor-clean-rebase branch October 20, 2025 10:20
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

skip-changelog Use this label in Pull Request to not require new changelog entry file

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants